Sentence Boundary Disambiguation: A User Friendly Approach
نویسندگان
چکیده
منابع مشابه
Sentence Boundary Disambiguation: A User Friendly Approach
In the present work we have developed an algorithm based on maximum entropy and stop word removal modules, which works with almost 99% accuracy and have established supremacy over the existing paragraph breaker software developed by Text Mining Group, School of Computer Science, Manchester University, United Kingdom .
متن کاملAdaptive Sentence Boundary Disambiguation
Labeling of sentence boundaries is a necessary prerequisite for many natural language processing tasks, including part-of-speech tagging and sentence alignment. Endof-sentence punctuation marks are ambiguous; to disambiguate them most systems use brittle, special-purpose regular expression grammars and exception rules. As an alternative, we have developed an e cient, trainable algorithm that us...
متن کاملA hybrid approach for urdu sentence boundary disambiguation
Sentence boundary identification is a preliminary step for preparing a text document for Natural Language Processing tasks, e.g., machine translation, POS tagging, text summarization and etc. We present a hybrid approach for Urdu sentence boundary disambiguation comprising of unigram statistical model and rule based algorithm. After implementing this approach, we obtained 99.48% precision, 86.3...
متن کاملAdaptive Multilingual Sentence Boundary Disambiguation
The sentence is a standard textual unit in natural language processing applications. In many languages the punctuation mark that indicates the end-of-sentence boundary is ambiguous; thus the tokenizers of most NLP systems must be equipped with special sentence-boundary recognition rules for every new text collection. As an alternative, this article presents an efficient, trainable system for se...
متن کاملChallenges in Urdu Text Tokenization and Sentence Boundary Disambiguation
Urdu is morphologically rich language with different nature of its characters. Urdu text tokenization and sentence boundary disambiguation is difficult as compared to the language like English. Major hurdle for tokenization is improper use of space between words, where as absence of case discrimination makes the sentence boundary detection a difficult task. In this paper some issues regarding b...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: International Journal of Computer Applications
سال: 2010
ISSN: 0975-8887
DOI: 10.5120/1269-1738